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CONTROL OF GENERALIZED ERROR RATES IN MULTIPLE 

TESTING 

By Joseph P. Romano 1 and Michael Wolf 2 
Stanford University and University of Zurich 

Consider the problem of testing s hypotheses simultaneously. 
The usual approach restricts attention to procedures that control 
the probability of even one false rejection, the familywise error rate 
(FWER). If s is large, one might be willing to tolerate more than 
one false rejection, thereby increasing the ability of the procedure 
to correctly reject false null hypotheses. One possibility is to replace 
control of the FWER by control of the probability of k or more false 
rejections, which is called the fc-FWER. We derive both single-step 
and step-down procedures that control the fc-FWER in finite sam- 
ples or asymptotically, depending on the situation. We also consider 
the false discovery proportion (FDP) defined as the number of false 
rejections divided by the total number of rejections (and defined to 
be if there are no rejections). The false discovery rate proposed by 
Benjamini and Hochberg [J. Roy. Statist. Soc. Ser. B 57 (1995) 289- 
300] controls iJ(FDP). Here, the goal is to construct methods which 
satisfy, for a given 7 and a, P{FDP > 7} < a, at least asymptotically. 
In contrast to the proposals of Lehmann and Romano [Ann. Statist. 
33 (2005) 1138-1154], we construct methods that implicitly take into 
account the dependence structure of the individual test statistics in 
order to further increase the ability to detect false null hypotheses. 
This feature is also shared by related work of van der Laan, Du- 
doit and Pollard [Stat. Appl. Genet. Mol. Biol. 3 (2004) article 15], 
but our methodology is quite different. Like the work of Pollard and 
van der Laan [Proc. 2003 International Multi-Conference in Com- 
puter Science and Engineering, METMBS'03 Conference (2003) 3-9] 
and Dudoit, van der Laan and Pollard [Stat. Appl. Genet. Mol. Biol. 
3 (2004) article 13], we employ resampling methods to achieve our 
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goals. Some simulations compare finite sample performance to cur- 
rently available methods. 

1. Introduction. The main goal of this paper is to show how computer- 
intensive methods can be used to construct asymptotically valid tests of 
multiple hypotheses under very weak conditions. In particular, we construct 
computationally feasible methods which provide control (at least asymptot- 
ically) of some generalized notions of the familywise error rate. However, the 
theory also applies to exact finite sample control in certain situations. 

Consider the problem of testing hypotheses H±, . . . , H s . A classical ap- 
proach to dealing with the multiplicity problem is to restrict attention to 
procedures that control the probability of one or more false rejections, which 
is called the familywise error rate (FWER). For a given family, control of 
the FWER at (joint) level a requires that FWER < a for all possible distri- 
butions of the data considered in the model. 

Of course, safeguards against false rejections are not the only concern of 
multiple testing procedures. Corresponding to the power of a single test, one 
must also consider the ability of a procedure to detect departures from the 
null hypotheses. When the number of tests, s, is large, such as in genomics 
studies, control of the FWER at conventional levels becomes so stringent 
that individual departures from the null hypotheses have little chance of 
being detected. For this reason, we shall consider alternatives to the FWER 
that control false rejections less severely in hopes of better power. 

First, we shall consider the fc-FWER, the probability of rejecting at least 
k true null hypotheses. More formally, suppose data X is available from 
some model P G Q. A general hypothesis H can be viewed as a subset uj of 
Q. For testing Hi: P G Ui, i = 1, . . . , s, let I(P) denote the set of true null 
hypotheses when P is the true probability distribution; that is, % G I(P) if 
and only if P G u>i. Then, the fc-FWER, which depends on P is defined to 
be 

(1) fc-FWERp = P{ reject at least k hypotheses Hi :i G I(P)}. 
Control of the fc-FWER requires that fc-FWER < a for all P; that is, 

(2) £>FWERp < a for all P. 

Evidently, the case k = 1 reduces to control of the usual FWER. 

We will also consider control of the false discovery proportion (FDP), 
defined as the total number of false rejections divided by the total number 
of rejections (and equal to if there are no rejections). Given a user specified 
value 7 G [0, 1), the measure of error control we wish to control is P{FDP > 
7}; thus, we wish to construct methods satisfying 



(3) 



P{FDP > 7} < a for all P. 
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We will derive methods where this holds (at least asymptotically). Evidently, 
control of the FDP with 7 = reduces to the usual FWER. Control of the 
false discovery rate (FDR) requires that E(FDP) < 7. 

Recently there have been a number of methods that control generalized 
error rates which are less stringent than the FWER. A prominent such 
technique is the FDR controlling method of [1]. Additional methods that 
control the FDR are given in [2] and [30]. Genovese and Wasserman [10] 
study asymptotic procedures that control the FDP (and the FDR) in the 
framework of a random effects mixture model. These ideas are extended in 
[20], where in the context of random fields, the number of null hypotheses 
is uncountable. Korn et al. [15] provide methods that control both the k- 
FWER and FDP; they provide some justification for their methods, but 
they are limited to a multivariate permutation model. Alternative methods 
of control of the A;-FWER and FDP are given in [34] ; they include both finite 
sample and asymptotic results. Like our work, their approach implicitly 
accounts for the dependence between the tests with the goal of improved 
ability to detect false hypotheses; comparisons between the methods will be 
made in Section 5. Building upon work for control of the FWER in [9, 22] 
and [28], we employ resampling to achieve our goals, which does not require 
the use of the subset pivotality condition of [35]. A further key ingredient 
is the use of the so-called A:-max statistic, initially suggested in [9] in the 
construction of a single-step procedure. Our procedures here can be seen 
as step-down improvements over such single-step methods. A further new 
method is given in [33]. 

Some methods that control the fc-FWER and FDP are now briefly re- 
viewed. Suppose that p- values pi, . . . ,p s are available for testing Hi, . . . , H s . 
For pi to be a p- value, it is required that, for all u G [0, 1] and all PGWj, 
P{pi < u} < u. Then, for any fixed k, the procedure that rejects Hi if 
Pi < ka/s controls the fc-FWER at level a, and can be viewed as a gen- 
eralization of the Bonferroni procedure which uses k = 1; see [17]. It is an 
example of a single-step procedure, meaning any null hypothesis is rejected 
if its corresponding p-value is less than or equal to a common cutoff value. 

Improvements are possible by considering a class of step-down procedures, 
which we now describe. Order the p- values by p^ < fi( 2 ) < " ' < P( s )i an d let 
-ff(x) , . . . , -ff( s ) denote the corresponding hypotheses. Let 

(4) a\ < a-2 < ■ ■ ■ < a s 

be constants. If pn\ > a±, reject no null hypotheses. Otherwise, if 

(5) <ai,...,jP( r ) <a r , 

reject hypotheses Hn\ , . . . , Hr r -\ , where the largest r satisfying (5) is used. 
The procedure of [13] uses ctj = a/(s — j + 1) and controls the FWER at 
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level a. For general k, consider the following generalized Holm step-down 
procedure described in (5), where now we specifically set 

ka 
s 



(6) a 3 



ka 



3<k, 
j>k. 



s + k — j ' 

Of course, the ay depend on s and k, but we suppress this dependence in 
the notation. Then the step-down method described in (5) with otj given by 
(6) controls the £;-FWER; that is, (2) holds; see [14] and [17]. 

Turning to FDP control, [17] reason as follows. To develop a step-down 
procedure satisfying (3), let F denote the number of false rejections. At step 
j, having rejected j — 1 hypotheses, we want to guarantee F/j < 7, that is, 
F < Ltj'J i where [x\ is the greatest integer < x. So, if k = |_7jJ + 1, then 
F > k should have probability no greater than a; that is, we must control 
the number of false rejections to be < k. Therefore, we use the step-down 
constant aj with this choice of k (which now depends on j); that is, 

Under certain dependence assumptions on the p- values, this method satis- 
fies (3). Some more conservative methods that hold under no dependence 
assumptions are also developed in [17, 25] and [26]. Typically, these general- 
ized Holm type of methods assume a least favorable joint distribution for the 
p- values. In contrast, here we implicitly try to estimate the joint distribution 
of p-values with the hope of greater ability to detect false hypotheses. 

In general, we suppose that rejection of Hi is based on large values of 
a test statistic T Ht i (with the subscript n used for asymptotic purposes). If 
a p- value pi is available for testing Hi, one can take T n ^ = —pi- Then we 
restrict attention to tests that reject an intersection hypothesis Hk when 
the feth largest of the test statistics {T n .j : i G K} is large. In some problems, 
[19] show that such stepwise procedures are optimal in a certain sense, in the 
case k = 1. Here, our primary goal is to show how computationally feasible 
step-down procedures can be constructed quite generally that control the 
fc-FWER and FDP under minimal conditions. 

In Section 2 we show that, if we estimate critical values that have a mono- 
tonicity property, then the basic problem of constructing a valid multiple 
test procedure that controls the fc-FWER can essentially be reduced to the 
problem of sequentially constructing critical values for (at most order s) sin- 
gle tests that control the usual Type 1 error. In particular, if finite sample 
methods which offer control of the Type 1 error are available for each of 
the individual tests, then this will immediately translate into control of the 
fc-FWER. Otherwise, we can apply bootstrap and subsampling methods to 
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achieve asymptotic control, as described in Section 3. Results for control 
of the FDP are obtained in Section 4. Comparisons with the augmentation 
procedures of [34] are discussed in Section 5. In Section 6 we present a sim- 
ulation study to examine the finite sample performance of various methods. 
The simulations demonstrate that our methods outperform or are at least 
competitive with currently available methods. All proofs are collected in the 
Appendix. 

2. Basic results for control of the fc-FWER. Suppose data X is gen- 
erated from some unknown probability distribution P. In anticipation of 
asymptotic results, we may write X = X^ , where n typically refers to the 
sample size. A model assumes that P belongs to a certain family of proba- 
bility distributions f2, though we make no rigid requirements for Q; it may 
be a parametric, semiparametric or a nonparametric model. 

Consider the problem of simultaneously testing a hypothesis Hi against 
H[, for i = 1, ...,s. Of course, a hypothesis Hi can be viewed as a subset 
uji of f2, in which case the hypothesis Hi is equivalent to P 6 and H[ is 
equivalent to P ^uji. For any subset K C {1, . . . , s}, define Hk = f)i£K Hi 
to be the intersection hypothesis that P G [\ i&K uji. We also assume a test 
of the individual hypothesis Hi is based on a test statistic T n i, with large 
values indicating evidence against Hi. 

Some further notation is required. Suppose {yi : i G K} is a collection 
of real numbers indexed by a finite set K having \K\ elements. Then, for 
k < \K\, k-max(yi :i G K) is used to denote the kth. largest value of the yi 
with i G K. So, if the elements yi, i G K, are ordered as ym < • • ■ < y(\K\)i 
then k-msx(yi : i G K) = y(\K\-k+l)- 

2.1. Single-step control of the k-FWER. Throughout this section, k is 
fixed. First, we briefly discuss a single-step approach to control the fc-FWER, 
since it serves as a building block for the more powerful step-down proce- 
dures considered later. For any subset K C {1, . . . , s}, let c nj x(a, k, P) denote 
an o-quantile of the distribution of k-max(T n ^ : i £ K) under P. Concretely, 



(We use the subscript n for asymptotic purposes, though the priority in 
this section is to study nonasymptotic results.) For testing the intersection 
hypothesis Hk with K C {1, . . . , s}, it is only required to approximate a 
critical value for P £ f] i&K L0i. Because there may be many such P, we define 



[In order to define c nt K(ct,k), we implicitly assume 1^=1^* is n °t empty] 



(8) 



c n ,K{oi, k, P) = inf{x : P{/c-max(T nj j : i G K) < x} > a}. 



(9) 
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Consider the idealized test that rejects any Hi for which T n ^ > c n jip\(l — 
a,k,P). This is a single-step method in that each T„ j is compared with a 
common cutoff. However, this is an idealization because the critical value 
c n,i(P) (1 — a ) k, P) is in general unknown. Such a fictional test clearly controls 
the fc-FWER at level a. Indeed, if |i"(P)| < k, then there is nothing to prove; 
otherwise, 

P{k or more false rejections} 

= P{k-max(T nti :i £ I(P)) > c n j( P )(l - a,k,P)} < a, 

with equality if the distribution of A;-max(T ni j :i £ I(P)) is continuous un- 
der P. Unfortunately, the test is unavailable as the critical value is in gen- 
eral unknown. One possible approach is to replace c n jip\(l — a,k,P) by 
C n,/(P)(1 ~~ a,k), but this still depends on P through I(P). Since I{P) is 
unknown, a conservative approach would be to assume all hypotheses are 
true and replace c n jrp\(l — a, k) by ^^(1 — a, k), where A = {1, . . . , s}. 

Unfortunately, in nonparametric problems, the sup in (9) may be formidable 
or impossible to calculate, and may be way too conservative anyway. Instead, 
another possibility is to replace the critical value c n /(p)(l — a, k, P) by some 
estimate c n ,7(p)(l — a,k), which is at least consistent or conservative. In gen- 
eral, suppose c n) i^(l — a,k) represents an approximation or estimate of the 
1 — a quantile of the distribution of k- max(T nj j : i £ K), at least valid when 
Hi is true for i £ K. Bootstrap and subsampling methods offer viable general 
approaches, and will be used later. Such a single-step approach using the k- 
max statistic was also discussed in [9] . (Rather than formalizing the required 
conditions for consistency right now, we will later give explicit conditions 
for more powerful step-down methods.) A single-step approach would then 
be to replace K by A = {1, . . . , s}. 

Example 2.1 [Multivariate normal mean). Suppose (Xi, . . . , X s ) is mul- 
tivariate normal with unknown mean /i = (^i, . . . , fx s ) and known covariance 
matrix S having component aij. Consider testing H \ : [n < versus 

/Xj > 0. Let T n> i = Xi/^o~i t i, since the test that rejects for large Xi / yja^i is 
UMP for testing Hi. For \K\ > k, c nj ^(l — a,k) is the 1 — a quantile of 
the distribution of /c-max(T raj j :i £ K) when /i = 0. A single-step approach 
would reject any Hi for which T n j > c njJ 4(l — a, k), where A = {1, . . . , s}. 
Since Cn^l — a, k) > c n /(p)(l — a, k) > c n /(p)(l — a, k, P), this procedure 
clearly controls the fc-FWER. 

More generally, suppose Hi specifies {P:6i(P) < 0} for some real- valued 
parameter Let 9 Ut i be an estimate of 0i(P). Also, let T n i = T n 9 n ^i for some 
nonnegative (nonrandom) sequence r n — > oo. The sequence r n is introduced 
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for later asymptotic purposes so that a limiting distribution for T n [O nj i — 
Oi(P)\ exists. In typical situations, r n = ra 1 / 2 . 

For K C {1, . . . , s} with \K\ >k, let L n; K(k, P) denote the distribution un- 
der P of £>max(T n [# n) j — 9i(P)] :i G K), with corresponding cumulative dis- 
tribution function L n> jc(x, k, P) and a-quantile b Ht K k, P) = inf{x : L Uj k(x, k, P) > 
a}. By the definition of these quantiles and using k = 1, 

(10) f(0<:«€ A) : max Tn^i-^i] <b nA (l - a,l,P) 

is an exact 1 — a level joint confidence region for the subset of parameters 
{6i(P) : i G A}. That is, the probability that the entire subset {9i{P) : i G A} 
will be contained in (10) is greater than or equal to 1 — a. By allowing k > 1, 
we can construct a "generalized" joint confidence region. More precisely, 
the probability that at least |A| — k + 1 elements of {0j(P) :i G A} will be 
contained in 

(11) { {Oi-.i e A) : k-maxT n [e nji - 9i] < b nA (l - a,k, P) 

is greater than or equal to 1 — a. In other words, the probability that k or 
more elements of {6i{P) : i G A} will fall outside (11) is less than or equal to 
a. 

A value of for 0i(P) falls outside the region (11) if and only if T n n ^ ^> 
b n> A(l — a, k, P). By the usual duality of confidence sets and hypothesis tests, 
this suggests the use of the critical value 

(12) c nA (l-a,k) = b nA (l-a,k,P) 

to control the fc-FWER. The problem is that the critical value (12) is not fea- 
sible, since P is unknown. Section 3 describe how approximate but feasible 
critical values can be obtained by the use of resampling methods. For exam- 
ple, the bootstrap replaces P by an estimated distribution Q n , resulting in 
the critical value c n ,A(l — a,k) = b n> A(l — a, k, Q n ). 



2.2. Step-down methods that control the k-FWER. Let 

(13) Tn,ri ^ Pn,r2 ^ " ' " — Pn,r s 

denote the observed ordered test statistics, and let H ri , H r2 , . . . , H Ts be 
the corresponding hypotheses. Step-down methods begin by first applying 
a single-step method, but then additional hypotheses may be rejected after 
this first stage by proceeding in a stepwise fashion, which we now describe. 
Begin by testing the joint null (intersection) hypothesis Hs\ a \ that all 
hypotheses are true. This hypothesis is rejected if T n<ri is deemed large, 
in which case H ri is rejected. Here, the meaning of large is determined by 
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some critical value c nj yi(l — a,k), which is designed to offer single-step con- 
trol when testing the intersection hypothesis Ha with A = {1, . . . ,s}. If it 
is not large, accept all hypotheses; otherwise, reject the hypothesis corre- 
sponding to the largest test statistic. Once a hypothesis is rejected, the next 
most significant hypothesis corresponding to the next largest test statistic 
is considered, and so on. At any stage, one tests appropriate intersection 
hypotheses Hk- Suppose that critical constants c nj x(l — a,k) are available 
from our statistical tool chest, which we might contemplate for use as a 
single step procedure for testing Hk- The critical constants c ni ^(l — a,k) 
may be fixed or random, but the reader should have in mind that they each 
could be used as a test of Hk- 

Algorithm 2.1 (Generic step-down method for control of the k-FWER). 

1. Let A\ = {1, . . . , s}. If max(T nj j : i G A\) < c n ^A- L (1 — a,k), then accept all 
hypotheses and stop; otherwise, reject any Hi for which T n> i > £^^(1 — 
a, k) and continue. 

2. Let i?2 be the indices i of hypotheses Hi previously rejected, and let 
A2 be the indices of the remaining hypotheses. If |J?2| < k, then stop. 
Otherwise, let 

d n a 2 (1 — a, k) = max {c n ^-(1 — a, k) : K = A% U /}. 

ICR 2 ,\I\=k-l 

Then, reject any Hi with i £ A 2 satisfying T n ^ > d njJ 4 2 (l — a, k). If there 
are no further rejections, stop. 

j. Let Rj be the indices i of hypotheses Hi previously rejected, and let Aj 
be the indices of the remaining hypotheses. Let 

d n a (1 — ct,k) = max \c n ^(1 — a, k) : K = Aj U I}. 
' 3 IcRj,\I\=k-l ' J 

Then, reject any Hi with i G Aj satisfying T n>i > d n> A.(l — a,k). If there 
are no further rejections, stop. 



And so on. 

Note that, in the case k = 1, once a hypothesis is removed, it no longer 
enters into the algorithm. However, for k > 1, the algorithm becomes slightly 
more complex. The reason is that, for control of the fc-FWER, we must ac- 
knowledge that when we consider a set of hypotheses not previously rejected, 
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we may have gotten to that stage by rejecting true null hypotheses, but hope- 
fully at most k — 1 of them. Since we do not know which of the hypotheses 
rejected thus far are true or false, we must maximize over subsets including 
some of those rejected, but at most k — 1 among the previously rejected 
ones. Our main point will be that, if we can control the /c-FWER at any 
stage of the algorithm, then the step-down test will control the /c-FWER. 

Remark 2.1 (Modified generic step-down method for control of the k- 
FWER). One can modify the above algorithm or any method that con- 
trols the /c-FWER as follows. If the method rejects at least k — 1 hypothe- 
ses, no modification is applied; otherwise, reject the k — 1 most significant 
hypotheses. This would not change control of the /c-FWER. However, we 
do not generally promote this modification, because hypotheses can be re- 
jected without compelling evidence (i.e., even if they have large unadjusted 
p- values) . 

In order to prove such an algorithm controls the /c-FWER for suitable 
choice of critical values c nj x(l — a,k), we assume monotonicity of the esti- 
mated critical values; that is, for any K D I(P), 

(14) c n , K {l-a,k) >c n j( P )(l-a,k). 

Ideally, we would also like the following to hold: if c ni x(l — a,k) is used 
to test the intersection hypothesis Hk, then the chance of k or more false 
rejections is bounded above by a when K = I(P); that is, 

(15) P{k-max(T n>i :ieI(P))>c nAP) (l- a, k)}<a. 

Under the monotonicity assumption (14), we will show the basic inequality 
that fe-FWERp is bounded above by left-hand side of (15). This will then 
show that, if we can construct monotone critical values such that each inter- 
section test controls the /c-FWER, then the step-down procedure controls 
the /c-FWER. Thus, the construction of a step-down procedure is effectively 
reduced to construction of single tests, as long as the monotonicity assump- 
tion holds (and it always does for specific choices studied later). 

Theorem 2.1. Let P denote the true distribution generating the data. 
Consider Algorithm 2.1 with critical values c n! if(l — a,k) satisfying (14). 

(i) Then 

(16) k-FWERp < P{k-m&x(T nA :i£ I(P)) > c n>/(P) (l - a, k)}. 

(ii) Therefore, if the critical values also satisfy (15), then k-FWERp < 

a. 
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The monotonicity assumption (14) cannot be removed, as shown in Exam- 
ple 2.1 of [28] in the case k = 1; an analogous construction works for general 
k. The general resampling constructions we describe later will inherently 
satisfy (14). 

As a corollary, consider the nonrandom choice of critical values c ni x(l — 
a, k) = c nt K(l — cc, k) defined in (9). Assume the following monotonicity as- 
sumption: for K D I(P), 

(17) Cn, K (l - a,k) > c„ ) /(p)(l - a,k). 

The condition (17) can be expected to hold in many situations because the 
left-hand side is based on computing the 1 — a quantile of the kth largest of 
\K\ variables, while the right-hand side is based on the fcth largest of |/(-P)| < 
| K | variables (though one must be careful and realize that the quantiles 
are computed under possibly different P, which is why some condition is 
required) . 

Corollary 2.1. Let P denote the true distribution generating the data. 
Assume p|| =1 ui{ is not empty. 

(i) Consider Algorithm 2.1 with c n ^K{^ — a,k) = c nj x(l — a, k) and as- 
sume (17). Then k-FWERp < a. 

(ii) Control persists if in Algorithm 2.1 the critical constants c nt K(^ — 
a,k) are replaced by d ni i^(l — a, k) which satisfy <i nj A'(l — ct,k) > c nj x(l — 
a, k). 

(iii) Moreover, the condition (17) may then be removed if the d n ^{^- — 
a, k) satisfy d n: K{^ — a,k) > d n j(p)(l — a, k) for any K D I(P). 

Example 2.2 {Multivariate normal mean, continuation of Example 2.1). 
Recall the setup of Example 2.1 with T n> i = Xij \jo~i^. To apply Corollary 2.1, 
assume that |/(-P)| > k or there is nothing to prove. Let c ni j<-(l — a, k) be the 
1 — a quantile of the distribution of fe-max(T„ , :i E K) when ji = 0. Since 
fc-max(T nj j :«€/)< k-m&x(T ni i :i G K) whenever I <Z K, (17) is satisfied. 
Moreover, the resulting procedure rejects at least as many hypotheses as 
the generalized Holm procedure, as it accounts for the dependence of the 
test statistics. 

The previous example is parametric in nature. However, we will see that a 
valid step-down approach can apply to nonpar ametric problems. Our main 
goal will be to apply resampling methods that can account for the depen- 
dence structure of the test statistics. We also observe that Theorem 2.1 
applies to certain semiparametric problems where permutation and random- 
ization tests apply. This was accomplished in the case k = 1 by [28] , but the 
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argument generalizes given Theorem 2.1. In fact, the result in [15] for k- 
FWER control in a specialized multivariate permutation setup is a special 
case of our results. 

However, we first observe the fact that the generalized Holm procedure 
described by (5) with critical values given by (6) controls the fc-FWER. This 
follows from Theorem 2.1 and the fact that, when testing \K\ hypotheses, 
the single-step procedure that rejects any Hi whose corresponding p- value is 
<ka/\K\ controls the fc-FWER; see Theorem 2.1 (i) of [17]. Note the critical 
values ka/\K\ are monotone in \K\. 

Outside some parametric models, application of the generic step-down 
method can be computationally intensive, so we will also consider the fol- 
lowing more streamlined algorithm. The basic idea is that at any stage, when 
testing whether or not to include further rejections, we need only look at 
the hypotheses not previously rejected together with the k — 1 hypotheses 
that are least significant among those previously rejected. So, we avoid max- 
imizing over all subsets of size k — 1 of previously rejected hypotheses and 
just look at the most "recent" k — 1 rejections. The arguments for such a 
procedure will be asymptotic. 

Algorithm 2.2 (Streamlined step-down method for control of the k- 
FWER). The algorithm is analogous to Algorithm 2.1. The only difference 
is that in any step j > 1 the critical value 

dn,AiO--ct,k)= max {c„k(1 - a, k) : K = Aj U 1} 
J ICRj,\I\=k-l 

is replaced by the critical value 

d n ,Aj( l ~a,k) = c ni ^(l - a,k), 

where K = {r^ Rj \_ k+2 ),r { \ Rj \_ k+1) , . . . ,r (s) }. 

3. Asymptotic results on fc-FWER control. The main goal of this section 
is to show how Theorem 2.1 can be used to construct step-down procedures 
that asymptotically control the fc-FWER under very weak assumptions. The 
use of resampling techniques will be a key ingredient. The methods con- 
structed will be based on Algorithm 2.1, and so potentially many tests are 
constructed in a stepwise fashion. However, a key feature is that the meth- 
ods will only require one set of resamples for all of the tests, whether they 
are bootstrap samples or subsamples. 

In order to accomplish this, we will consider resampling schemes that 
do not obey the null hypothesis constraints. Such schemes have been sug- 
gested previously by [9] and [22], and have the benefit of avoiding the subset 
pivotality condition of [35]. Hypothesis test constructions that do obey the 
constraints imposed by the null hypothesis, as discussed in [4] and [24], are 
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based on the idea that the critical value should be obtained under the null 
hypothesis and so the resampling scheme should reflect the constraints of 
the null hypothesis. This idea is even advocated as a principle in [12], and 
it is enforced throughout [35]. While appealing, it is by no means the only 
approach toward inference in hypothesis testing. In some problems, the sub- 
set pivotality condition of [35] holds, and so the same null distribution can 
be used at each step. However, this condition does not hold in general; for 
instance, see Example 4.1 of [28]. To obtain a more general construction, we 
exploit the well-known explicit duality between tests and confidence inter- 
vals; so, if one can construct good or valid confidence intervals, then one can 
construct good or valid tests, and conversely. The same holds for simultane- 
ous confidence sets and multiple tests. 

We shall consider two concrete applications of Theorem 2.1, the first based 
on the bootstrap and the second based on subsampling. The symbols 
and — ► will denote convergence in law (or distribution) and convergence in 
probability, respectively. 

3.1. A bootstrap construction. We now apply Theorem 2.1 to develop an 
asymptotically valid approach based on the bootstrap, but specializing to the 
case where Hi is concerned with a test of a parameter. Suppose hypothesis 
Hi is specified by {P:9i(P) < 0} for some real- valued parameter 0i. Implic- 
itly, the alternatives are one-sided, but the two-sided case can be similarly 
handled. Suppose 9 n> i is an estimate of 9i. Also, let T n ^ = T n 9 n> i for some 
nonnegative (nonrandom) sequence r n — ► oo. The sequence T n is introduced 
for asymptotic purposes so that a limiting distribution for T n [9 n ^ — 6i(P)] 
exists. In typical situations, T n = n 1 / 2 . 

The bootstrap method relies on its ability to approximate the joint dis- 
tribution of {r n [9 nt i — 9i(P)]:i G K}, which we denote by J n ,K(P)- For 
K C {l,...,s} with |_K"| > k, let L ni x(k, P) denote the distribution under 

P of k- max(T n [#n,i — ^i(P)] '-i £ K)> with corresponding c.d.f. L n! x(x,k,P) 
and a-quantile b^xipc, k, P) = inf{x : L u> k(x, k, P) > a}. 

We will assume the normalized estimates satisfy the following. 

Assumption Bl. 

(i) Jn,{i,...,s}(P) ~* J{i,...,s}(P)i a nondegenerate limit law. 

(ii) Ljrp\(-,k,P) is continuous and strictly increasing on its support. 

Part (i) implies that, for every K C I(P), L U} K(k,P) has a limiting dis- 
tribution LK{k,P). Indeed, the fc-max function is a continuous function and 
the continuous mapping theorem applies; see Lemma A.l. Part (ii) makes 
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an additional mild assumption on the limit law Ljrp\(k,P). In particular, 
under Assumption Bl, it follows that 

(18) K,i(P) (! - «> k i p ) -> b i(P) C 1 - ^ p )> 

where bj(p)(a, k, P) is the a-quantile of the limiting distribution Lj^(k, P). 

Let Q n be some unrestricted estimate of P, that is, Q n does not obey 
the null hypothesis constraints. For i.i.d. data, in the absence of a para- 
metric model for P, Q n is typically taken to be the empirical distribution 
of the observed data, or possibly a smoothed version (i.e., nonparametric 
bootstrap); on the other hand, if a parametric model for P is assumed, then 
Q n should be based on this model (i.e., parametric bootstrap); see [7]. For 
time series or data-dependent situations, bootstrap methods that can cap- 
ture the underlying dependence structure should be employed, such as block 
bootstraps, sieve bootstraps or Markov bootstraps; see [16]. Then a nominal 
1 — a level bootstrap joint confidence region for the subset of parameters 
{Oi{P) : i £ K} is given by 

{{Oi-.ieK): max(r n [§ ni - 0*] : i £ K) < b n>K (l - a, 1, Q n )} 

(19) 

= {(6i : i £ K) : 0% > n ,i - r^KM 1 - a, 1, Q n )}. 

So a value of for 0i(P) falls outside the region if and only if T n O n i > 
^n,x(l — o,l,Q n ). By the usual duality of confidence sets and hypothesis 
tests, this suggests the use of the critical value 

(20) Cn tK (l- a,l) = b n> jc{l- a,l,Q n ), 

to control the familywise error rate (i.e., the fe-FWER with k = 1). Since 
here we require control of the fe-FWER, we merely replace the max in (19) 
with the fc-max and 6 nj if(l — a, 1, Q n ) with b n i<-(l — a, k, Q n ). Such a gen- 
eralized joint confidence region should asymptotically contain all true pa- 
rameter values except for possibly at most k — 1 of them, with probability 
(asymptotically) at least 1 — a. Thus, the bootstrap critical value we use 
will be 

(21) Cn,K0- -a,k) = 6 n ,x(l - a,k,Q n ). 

Note that, regardless of asymptotic behavior, the monotonicity assump- 
tion (14) is always satisfied for the choice (21). Indeed, for any Q and if 
/ C K, b n j(l — a,k,Q) is the 1 — a quantile under Q of the A:-max of |/| 
variables, while &n,x(l — ol, k, Q) is the 1 — a quantile of the fc-max of these 
same |/| variables together with \K\ — |/| additional variables. This simple 
observation together with Theorem 2.1 immediately yields: 
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Corollary 3.1. Under the setup and notation of this subsection, con- 
sider Algorithm 2.1 with critical values given by (21). Then 

(22) k-FWERp < P{k-max(T nti : i £ J(P)) > b nJ(P) (l - a, k, Q n )}. 

Therefore, in order to conclude limsup n /c-FWERp < a, it is now only 
necessary to study the asymptotic behavior of b n jrp\(l — a, k, Q n ). For this, 
we further assume the usual conditions for bootstrap consistency when test- 
ing the single hypothesis that 9i(P) < for all i 6 I(P)', that is, we assume 
the bootstrap consistently estimates the joint distribution of T n [6 n> i — Oi(P)] 
for i G P\P). Specifically, consider the following (more general) assumption. 

Assumption B2. For any metric p metrizing weak convergence on Rli 1 '---' 5 }! : 

p 

P{Jn,{l,...,s}(P), •4,{l,..., S }(Qn)) ~> 0. 

The Assumptions Bl and B2 are quite standard in the bootstrap liter- 
ature, and readily hold for general classes of statistics, such as estimators 
which are smooth functions of means, [/-statistics, L-statistics, estimators 
which are differentiable functions of the empirical process, and so forth; see 
[11, 31] and Chapter 1 of [21]. Thus, our results apply to a wide range of 
problems. Under these assumptions, the following theorem proves asymp- 
totic control of the fc-FWER of our bootstrap method. 

Theorem 3.1. Fix P satisfying Assumption Bl. Let Q n be an estimate 
of P satisfying Assumption B2. Consider the method of Algorithm 2.1 with 
Cn >K (l-a,k) given by b ntK (l - a,k,Q n ). 

(i) Then limsup n k-FWERp < a. 

(ii) If P is such that i ^ I{P), that is, Hi is false and 6i(P) > 0, then 
the probability that the step-down method rejects Hi tends to 1. 

Remark 3.1. Typically, one would like to choose test statistics that 
lead to procedures that are balanced in the sense that all tests have about 
the same power and contribute equally to error control, as argued by [5, 23] 
and [32]. Achieving balance is best handled by appropriate choice of test 
statistics. For example, using p-values as the basic statistics will lead to 
better balance. Quite generally, Beran's prepivoting transformation can lead 
to balance; see [5] and [6]. Alternatively, balance can sometimes be achieved 
by Studentization. 

We now briefly consider the two-sided case. Suppose Hi specifies 6i(P) = 
against the alternative 6i(P) ^ 0. Let L' nK (k,P) denote the distribution 
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under P of k-max.(T n \8 nt i — 9i(P)\ -i £ K) with corresponding distribution 
function L' n K (x, k, P) and a-quantile b' n K (a, k, P) = inf{x : L' n K (x, k, P) > 
a}. Accordingly, L' K (k,P) denotes the limiting distribution of L' n K (k,P). 

Finally, let T' n i = T n \9 n ^\. The following theorem extends Theorem 3.1 to 
the two-sided case. 

Theorem 3.2. Fix P satisfying Assumption Bl, but with Lj^(k, P) 
in Assumption Bl(ii) replaced by L'j, p \(k,P). Let Q n be an estimate of P 
satisfying Assumption B2. Apply Algorithm 2.1 using the test statistics T' ni 
and with c n ,K:0- — a, k) given by b' n K (l — a, k, Q n ). 

(i) Then limsup n k-FWERp < a. 

(ii) If P is such that i ^ I(P), that is, Hi is false and Qi{P) ^ 0, then 
the probability that the step-down method rejects Hi tends to 1. 

(iii) Moreover, if the above algorithm rejects Hi and it is declared that 
9i > when 6 n ^ > 0, the probability of making a Type 3 error [i.e., of declar- 
ing 9i(P) positive when it is negative or declaring it negative when it is 
positive] tends to 0. 

So far, the bootstrap construction has been based on Algorithm 2.1. The 
following theorem shows that asymptotic control of the fc-FWER is also 
achieved by the computationally less expensive streamlined Algorithm 2.2. 
For brevity we only focus on the one-sided case, that is, the setting of The- 
orem 3.1; the two-sided case is similar. 

Theorem 3.3. Fix P satisfying Assumption Bl. Let Q n be an estimate 
of P satisfying Assumption B2. Consider the step-down method in Algorithm 
2.2 with c nj j<-(l — a,k) replaced by 6 nj x(l — a,k,Q n ). Then the conclusions 
of Theorem 3.1 continue to hold. 

Remark 3.2. The proofs of both Theorems 3.1 and 3.3 rely on asymp- 
totic arguments. Nevertheless, some important differences should be pointed 
out. First, the method based on Algorithm 2.1 is more conservative than the 
one based on the streamlined Algorithm 2.2: the latter will reject all the hy- 
potheses rejected by the former and potentially some further ones. 

Second, if instead of the estimated critical values 6 n ,ii:(l — oc,k,Q n ) the 
exact critical values 6 ni ^(l — a, k, P) could be used in place of c nj A'(l — a,k), 
then Algorithm 2.1 would provide finite sample control of the /c-FWER while 
Algorithm 2.2 would not. 

Third, the bootstrap construction based on Algorithm 2.1 provides asymp- 
totic control of the fc-FWER in the case of contiguous alternatives while the 
construction based on Algorithm 2.2 may not. (An introduction to contiguity 
is given in Section 12.3 of [18].) 
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Remark 3.3 (Operative method). The previous remark provides some 
motivation to base the bootstrap construction on the more conservative 
generic Algorithm 2.1. On the other hand, its computational burden can 
be very high. To compute the critical value d n ,Aj{^- — k) in the jth step, 
one has to evaluate Nj = quantiles c^i^l — a,k) in order to then 

take the largest one of those. Depending on Rj and k, this number Nj may 
be very large. Therefore, we now suggest an operative method that retains 
some of the desirable properties of Algorithm 2.1 while remaining always 
computationally feasible. The suggestion is as follows. Pick a user specified 
number iV max , say iV max = 50, and let M be the largest integer for which 
(fc^i) — ^max' In step j of Algorithm 2.1, the critical value is then computed 
as 

dn,Aj(^- — Oi,k)= max {c nt K(l — a,k) : K = AjUl}. 

^C{r max { 1) | flj .|_ Af+1 },...,r| H ^|},|J|=fc-l 

That is, we maximize over subsets I not necessarily of the entire index set 
Rj of previously rejected hypotheses, but only of the index set corresponding 
to the M least significant hypotheses rejected so far. (Of course, when M > 
\Rj\, we maximize over all subsets I of Rj of size k — 1.) The philosophy of 
this operative method is to be as close as possible to the generic Algorithm 
2.1, given the limitation to the computational burden expressed by -/V max . 
Finally, note that the streamlined algorithm is a special case of the operative 
method when iV max = 1 is chosen, resulting in M = k — 1. 

3.2. A general subsampling construction. In this subsection, we present 
an alternative construction of critical values in our step-down procedure by 
using subsampling. Unlike the previous subsection, we do not assume Hi 
is concerned with the test of a parameter 0i\ the approach here is quite 
general and will hold under weaker asymptotic conditions as well. For any 
K C {1, . . . , s}, let G n< K(P) be the joint distribution of the statistics T n j, 
i €. K, under P, with corresponding joint c.d.f. G Uj k(x,P), x E M) k \. Also, 
let H nt x(k, P) denote the distribution of /c-max(T nj j :i G K) under P. As in 
Section 2.1, let c nj x(l — a,k,P) denote a 1 — a quantile of H nj x(k, P). 

We will make the following general assumption. 

Assumption S. Under P, the joint distribution of the test statistics 
T n i, i G I(P), has a limiting distribution; that is, 

(23) G nAP) {P)hG m {P). 

This implies that, under P, /c-max(T nj j : i G I(P)) has a limiting distribution, 
say Hj( P j(k,P), with limiting c.d.f. Hj^(x,k,P). Let Cjr P \(a,k,P) denote 
an a-quantile of Hjrp\(k,P); that is, 

cjrp\(a, k, P) = inf{x : Hj/ P \(x, k, P) > a}. 
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We will assume further that Hj^{x, k, P) is continuous and strictly increas- 
ing at x = c/(p)(l — a, k, P). 

Note that the above continuity condition is satisfied if the |/(-P)| univari- 
ate marginal distributions of Gj^(P) are continuous; see Lemma A.l. Also, 
the strictly increasing assumption can be removed; see Remark 1.2.1 of [21]. 

We now detail the general subsampling construction. To this end, as- 
sume that we have available an i.i.d. sample X\, . . . , X n from P, and T n ^ = 
T n> i(Xi, . . . , X n ) is the test statistic we wish to use for testing H{. To de- 
scribe the test construction, fix a positive integer b < n and let Y±, . . . , Y^ n 
be equal to the N n := subsets of {X\, . . . , X n }, ordered in any fash- 
ion. Let T^f be equal to the statistic evaluated at the data set Y a , for 
a = 1, . . . , N n . Then, for any subset K C {1, . . . , s}, the joint distribution of 
{T n ,i ■ i G K) can be approximated by the empirical distribution of the N n 

values {T^f :i G K}. In other words, for x G M s , the true joint c.d.f. of the 
test statistics evaluated at x, 

G n,{l,...,s}{x,P) =P{T Tly i < Xl,...,T n>s < X s }, 

is estimated by the subsampling distribution 

(24) G nAl _ s} (x) = - J- £ 1{T$ <*!,..., T$ < x s }. 

n a 

Note that the marginal distribution of any subset K C {1, . . . , s}, G n< K(P), 
is then approximated by the marginal distribution induced by (24) on that 
subset of variables. So, G nt x refers to the empirical distribution of the val- 
ues {T^f -i G K}. (In essence, one only has to estimate one joint sampling 
distribution for all the test statistics because this then induces that of any 
subset, even though we are not assuming anything like subset pivotality.) 

Similarly, the estimate of the whole joint distribution of test statistics 
induces an estimate for the distribution of the maximum or kih largest of test 
statistics. Specifically, H Ut K(k, P) is estimated by the empirical distribution 

Hn,i<{x, k) of the values fc-max(T^" : i G K); that is, 
Also, let 

(25) c n ^(l — a,k) = inf{x : H nj K(x, k) > 1 — a} 

denote the estimated 1 — a quantile of the k- max of test statistics T n j with 
i £ K. 



18 



J. P. ROMANO AND M. WOLF 



Note the monotonicity of the critical values: for I C K 

(26) Cn tK (l-a,k)>c n j(l-a,k). 

This simple observation together with Theorem 2.1 immediately yields: 

Corollary 3.2. Under the setup and notation of this subsection, con- 
sider Algorithm 2.1 with critical values given by (25). Then 

(27) k-FWERp < P{k-m&x(T nt i :i G I(P)) > c nJ{P) (l - a, k)}. 

The following result proves consistency and fc-FWER control of our step- 
down algorithm based on these subsample estimates of critical values. Note, 
in particular, that Assumption B2 is not needed here at all, a reflection of the 
fact that the bootstrap requires much stronger (local uniform convergence) 
assumptions for consistency; see [21]. 

Theorem 3.4. Suppose Assumption S holds. Let b/n — ► 0, Tb/r n — > 
and b — > oo. 

p 

(i) The subsampling approximation satisfies p{G n j(p), G n jrp-i(P)) — > 
for any metric p metrizing weak convergence on W 1 ^''. 

The subsampling critical values satisfy c n ,/(p)(l — a,k) — ► c/(p)(l — 

a, k). 

(iii) Therefore, using Algorithm 2.1 with Cnjc^l — a, k) given by (25) re- 
sults in limsup n k-FWERp < a. 

The above approach can be extended to dependent data; see [21]. 

4. Asymptotic results on FDP control. In some applications, one might 
be willing to tolerate a larger number of false rejections in case the total 
number of rejections is large. In other words, one might be willing to tolerate 
a certain (small) fraction of false rejections out of the total rejections. This 
leads to control based on the false discovery proportion (FDP). Let F be 
the number of false rejections made by a multiple testing procedure and let 
R be the total number of rejections. Then the FDP is defined as 

FDP = {f< ifjR>0 > 
U, if 22 = 0. 

A multiple testing procedure is said to control the FDP at level a if, for the 
given sample size n, P{FDP > 7} < a, for all P. A multiple testing proce- 
dure is said to asymptotically control the FDP at level a, if limsup n P{FDP > 
7} < ot, for all P. Our focus will be on procedures that provide asymptotic 
control. Notice that a procedure satisfying P{FDP > 7} < 0.5 guarantees 
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that the median of the FDP is < 7. The main goal of this section is to 
construct a method which provides asymptotic control of the FDP. 

The approach we propose is built upon an underlying procedure that 
(asymptotically) controls the /c-FWER for any fixed k > 1. We then se- 
quentially apply this /c-FWER procedure for k = 1,2,... until a stopping 
rule indicates termination. In the end, we reject all hypotheses that were 
rejected in the last round of applying the fc-FWER procedure. 

Algorithm 4.1 (Generic method for control of the FDP). 

1. Let j = 1 and let k\ = 1. 

2. Apply the fcy-FWER procedure and denote by Nj the number of hypothe- 
ses it rejects. 

3. (a) If Nj < kj/j — 1, stop and reject all hypotheses rejected by the 
fcj-FWER procedure. 

(b) Otherwise, let j =j + 1 and then kj = /c./-i + 1. Return to Step 2. 

Note that the algorithm does not assume anything about the nature of 
the underlying /c-FWER procedure. However, in order to reject as many 
false hypotheses as possible while maintaining (asymptotic) control of the 
FDP, we suggest to employ a stepwise procedure that accounts for the de- 
pendence structure of the test statistics T n ^. Algorithm 4.1 is similar to the 
proposal of [15] for FDP control which is, however, restricted to a multivari- 
ate permutation model. The proposal of [15] is heuristic in the sense that 
they cannot guarantee finite sample or asymptotic control of the FDP even 
if the permutation hypothesis is valid. However, we will show asymptotic 
control (and simulations presented later show good finite sample control). 
The theorem below considers a general bootstrap construction where the 
individual tests are one-sided and concern univariate parameters 0i(P). The 
bootstrap construction for two-sided tests and the more general subsampling 
construction can be handled similarly. 

Theorem 4.1. Consider the setup of Theorem 3.1. Fix P satisfying As- 
sumption Bl. Let Q n be an estimate of P satisfying Assumption B2. Em- 
ploy the step-down procedure of Algorithm 2.1 with c nj x(l — a,k) replaced by 
bn,i<(l — Oi,Q n , k) as the underlying k-FWER procedure. Then the following 
statements concerning Algorithm 4.1 are true: 

(i) limsup„P{FL)P> 7} < a. 

(ii) If P is such that i ^ I(P), that is, Hi is false and Oi(P) > 0, then 
the probability that the method rejects Hi tends to 1. 

Remark 4.1. The theorem remains valid if the bootstrap /c-FWER pro- 
cedure is based on the operative method of Remark 3.3 or the streamlined 
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Algorithm 2.2 instead of the generic Algorithm 2.1. But, again, in view of 
finite sample performance, we suggest the use of the generic Algorithm 2.1 
if feasible or at least the use of the operative method. 

5. Comparison with related methods. We have proposed step-down pro- 
cedures that control the fc-FWER and the FDP, with the goal of improving 
upon methods that do not attempt to incorporate or estimate the depen- 
dence structure between the test statistics or p-values. An alternative ap- 
proach toward achieving this goal is given in [34] . We briefly discuss their pro- 
posal. (Note that resampling-based procedures of [9] and [34], among others, 
are implemented in the open source R package multtest released as part of 
the Bioconductor Project; see cran.r-project.org and www . bioconductor . org.) 

The approach of [34] begins with an initial procedure that controls the 
1-FWER (i.e., the usual FWER) and then rejects in addition the k — 1 most 
significant hypotheses not rejected so far. They coin this an augmentation 
procedure, since the 1-FWER rejection set is augmented by the k—1 next 
most significant hypotheses to arrive at the /c-FWER rejection set. Obvi- 
ously, if the 1-FWER procedure succeeds in (asymptotically) controlling the 
1-FWER, then the augmented procedure provides (asymptotic) control of 
the fc-FWER. However, this approach seems suboptimal, because it makes 
the worst case assumption that, having achieved 1-FWER control, the k — 1 
next most significant hypotheses are all true hypotheses. Moreover, k — 1 
additional hypotheses are always rejected, even if the test statistics or p- 
values to which they correspond are clearly not significant. In addition, the 
approach really does not fully utilize the weaker measure of error control af- 
forded by using the fc-FWER with k > 1, in that the augmentation method 
will reject more than k — 1 hypotheses if and only if the 1-FWER controlling 
procedure rejects some hypotheses, and this criterion may be too strong to 
admit any rejections. 

Our approach to control the /c-FWER is based on knowing or estimat- 
ing the sampling distribution of a suitable A:- max statistic, that is, the kth 
largest of the s individual (possibly standardized) test statistics. A hypoth- 
esis is rejected if its corresponding test statistic is large (relative to the 
estimated quantiles of the sampling distribution of the A;- max statistic), un- 
like the augmentation approach where a hypothesis can be rejected even if 
its corresponding test statistic is not deemed large by any measure. 

To appreciate how the two approaches differ, first consider augmentation 
based on the Holm procedure, given by (6) with k = 1. Other than the addi- 
tional k — 1 hypotheses that are rejected after applying Holm, the procedure 
can only reject a nontrivial number (k or more) if and only if the smallest 
p- value is <a/s. On the other hand, the generalized Holm procedure starts 
out with a great advantage; the smallest p- value is compared with ka/s, 
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a fc-fold increase. While it is possible for augmentation to reject more hy- 
potheses, it can only reject k — 1 more than the Holm procedure (and these 
additional rejections may be suspect because they can correspond to large 
p- values), but the generalized Holm procedure can reject many, many more. 

Similar comparisons can be made with augmentation applied to a FWER 
controlling procedure that attempts to account for the dependence structure 
(like the ones in this paper with k = 1). Augmentation might reject k — 
1 more hypotheses than the ones we propose here, but our methods can 
easily reject many more. Note that, if the test statistics or p-values are 
independent, then augmentation of a bootstrap method that controls the 
FWER still cannot produce anything much better than the Holm method. 

The comparison is similar for the procedures controlling the FDP. Our 
approach is to sequentially apply a fc-FWER procedure for k = 1,2,... until 
a stopping rule indicates termination. On the other hand, [34] again augment 
the rejection set of an initial 1-FWER procedure. The idea now is as follows. 
Let R denote the number of rejections by the 1-FWER procedure. Then 
reject in addition the D next most significant hypotheses where D is the 
largest integer which satisfies 



Again, if the 1-FWER procedure succeeds in (asymptotically) controlling 
the 1-FWER, then the augmented procedure provides (asymptotic) control 
of the FDP. But also again, this approach seems pessimistic in that it makes 
the worst case assumption that, having achieved 1-FWER control, the D 
next most significant hypotheses are all true hypotheses. 

The next section compares the finite sample performance of the two ap- 
proaches. 

6. Simulation study. This section presents a small simulation study in 
the context of testing population means. We generate random vectors 
X\ , . . . , X n from an s-dimensional multivariate normal distribution with 
mean vector 9 = (9\, . . . , 9 S ), where n = 100 and s = 50 or s = 400. The 
null hypotheses are -ffi : #i < and the alternative hypotheses are Hi\9i> 0. 
The test statistics are T n> i = y/nXi t ./ Si, where 



D 



D + R 




and 
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The individual means Oi are equal to either or 0.25. The number of 
means equal to 0.25 is 0, 10, 25 or 50 when s = 50 and 0, 100, 200 or 400 
when s = 400. The covariance matrix is of the common correlation structure 
= 1 and aij = p for j. We consider the three values p = 0.0, 0.5 and 
0.8. Other specifications of the covariance matrix do not lead to results that 
are qualitatively different; see [27]. 

We include the following multiple testing procedures in the study. The 
value of A; is k = 3 when s = 50 and k = 10 when s = 400. The nominal level 
is a = 0.05, unless indicated otherwise. 

• (1-Boot) The bootstrap 1-FWER construction of Section 3.1. (This con- 
struction is equivalent to the FWER maxT procedure of [9].) 

• (A;- Aug) The A>FWER augmentation procedure of [34]. 

• (fc-gH) The fc-FWER generalized Holm procedure described by (6). 

• (fe-Boot) The bootstrap A>FWER construction of Section 3.1. 

• (Augo.i) The FDP augmentation procedure of [34] with 7 = 0.1. 

• (EB0.1) The empirical Bayes FDP procedure of [33] with 7 = 0.1. 

• (LR0.1) The FDP procedure of [17] with 7 = 0.1; see (7). 

• (Booto.i) The bootstrap FDP construction of Section 4 with 7 = 0.1. 

• (Boot^f 1 ) The bootstrap FDP construction of Section 4 with 7 = 0.1 but 
nominal level a = 0.5. Therefore, this procedure asymptotically controls 
the median FDP to be bounded above by 7 = 0.1. 

The augmentation procedures /c-Aug and Augo.i are both based on the step- 
down 1-Boot construction as the initial 1-FWER controlling procedure. The 
/c-Boot procedure is based on the operative method with iV max = 50; see 
Remark 3.3. The estimate Q n employed in the bootstrap is the empirical 
distribution of the observed data; and for each simulated data set, the same 
set of B = 500 resamples is shared by all bootstrap procedures. The indi- 
vidual p- values for /c-gH and LR0.1 are derived from the relation T n j ~ t n -\ 
under Hi. 

The performance criteria are (i) the empirical /c-FWERs and FDPs, com- 
pared to the nominal level a = 0.05 (or a = 0.5 for the method controlling 
the median FDP); and (ii) the average number of false hypotheses rejected. 
Since the /c-Aug procedure rejects the k — 1 most significant hypotheses 
regardless of the data, we also follow this route for the fc-gH and fc-Boot 
procedures to ensure a fair comparison as far as (ii) is concerned (though 
the differences are really negligible if this route is not followed for the A;-gH 
and /c-Boot procedures) . The results are presented in Table 1 for s = 50 and 
in Table 2 for s = 400. They can be summarized as follows. 

• Almost all methods provide satisfactory finite sample control of their re- 
spective /c-FWER or FDP criteria. In particular, the finite sample control 
does not appear to deteriorate when the number of hypotheses is increased 
from s = 50 to s = 400, while the sample size is kept fixed at n = 100. 
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Table 1 

Empirical FWEs and FDPs expressed as percentages (in the rows 11 Control") and 
average number of false hypotheses rejected (in the rows "Rejected") for various 
methods, with n = 100 and s = 50. The nominal level is a = 5%, apart from the last 
column where it is a = 50%. The number of repetitions is 5,000 per scenario and the 
number of bootstrap resamples is B = 500 



1-Boot 



3-Aug 3-gH 3-Boot 



Augo.i EBo.i LRo.i Boot .i 



Boot^r 



Control 5.0 

Rejected 0.0 

Control 4.8 

Rejected 2.6 

Control 3.3 

Rejected 6.9 

Control 0.0 

Rejected 14.9 



Control 5.3 

Rejected 0.0 

Control 5.0 

Rejected 3.4 

Control 4.3 

Rejected 8.7 

Control 0.0 

Rejected 20.2 



Control 4.9 

Rejected 0.0 

Control 5.1 

Rejected 4.8 

Control 4.6 

Rejected 12.2 

Control 0.0 

Rejected 27.1 



5.0 
0.0 

0.1 

4.5 

0.0 
8.9 

0.0 
16.9 



5.3 
0.0 

2.9 
5.2 

2.0 
10.6 

0.0 
22.0 



4.9 
0.0 

5.1 
6.3 

4.6 
13.8 

0.0 
28.4 



Common correlation: p — 
All 9i = 

0.0 4.6 5.0 29.5 

0.0 0.0 0.0 0.0 

Ten 6, = 0.25 

0.0 3.1 4.8 24.1 

3.9 6.3 2.6 5.0 

Twenty-five 9i = 0.25 

0.0 2.2 1.9 4.5 

9.5 16.7 7.2 15.5 

All 9i = 0.25 

0.0 0.0 0.0 0.0 

19.2 42.3 16.2 46.8 

Common correlation: p - 
All 0i = 



1.6 

0.0 

1.4 
4.3 

0.1 
9.6 

0.0 
19.2 



5.3 
0.0 

Ten 0i 
4.5 
5.6 



5.3 
0.0 
= 0.25 
4.1 

3.4 



0.5 

13.1 
0.0 

9.2 
5.0 



Twenty-five 9 t = 0.25 



4.4 
14.2 
All 

0.0 
33.0 



2.8 
9.2 
= 0.25 
0.0 
21.5 



8.4 
13.9 

0.0 
39.3 



1.3 
0.0 

1.4 

4.6 

0.1 
9.9 

0.0 
19.5 



Common correlation: p = 0.8 
All 9i = 
5.2 4.9 6.6 

0.0 0.0 0.0 

Ten 9i = 0.25 
4.9 4.1 6.6 

6.4 4.9 5.1 

Twenty-five 9i = 0.25 



4.5 
15.5 
All 

0.0 
33.7 



4.5 
12.8 
= 0.25 

0.0 
27.9 



7.4 
14.7 

0.0 
38.0 



4.9 
0.0 

4.4 
2.6 

2.1 
7.2 

0.0 
21.3 



3.0 
0.0 

2.6 
2.7 

1.6 
7.8 

0.0 
22.5 



1.4 
0.0 

1.6 
2.7 

1.6 
7.8 

0.0 
21.4 



5.0 
0.0 

4.8 
2.6 

3.0 
7.8 

0.0 

45.3 



5.3 
0.0 

4.7 
3.4 

4.5 
10.4 

0.0 
30.6 



4.9 
0.0 

4.9 
4.9 

4.5 
13.5 

0.0 
33.1 



52.3 
0.0 

47.6 
6.3 

39.1 
21.3 

0.0 
50.0 



50.7 
0.0 

49.0 
8.3 

47.2 
22.8 

0.0 
48.9 



50.0 
0.0 

49.2 
9.3 

48.0 
23.9 

0.0 
49.0 
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Table 2 

Empirical FWEs and FDPs expressed as percentages (in the rows "Control") and 
average number of false hypotheses rejected (in the rows " Rejected") for various 
methods, with n = 100 and s = 400. The nominal level is a = 5%, apart from the last 
column where it is a = 50%. The number of repetitions is 5,000 when all 9i = and 
2,000 for all other scenarios; and the number of bootstrap resamples is B = 500 



1-Boot 



10-Aug 10-gH 10-Boot 



Augo.i EBo.i LRo.i Boot .i 



BooCr 



Control 5.0 

Rejected 0.0 

Control 4.3 

Rejected 10.9 



Control 
Rejected 



Control 
Rejected 



Control 
Rejected 

Control 
Rejected 



2.7 
22.0 



Control 0.0 
Rejected 45.4 



Control 5.5 
Rejected 0.0 



4.9 
18.3 



Control 5.0 

Rejected 38.3 

Control 0.0 

Rejected 84.3 



Control 5.3 
Rejected 0.0 



4.9 
36.2 

5.4 
74.3 



Control 0.0 
Rejected 165.7 



Common correlation: p — 
All 0i = 

5.0 0.0 1.6 4.9 4.4 4.8 4.9 54.4 
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 

One hundred t = 0.25 
0.0 0.0 0.5 1.0 1.7 1.5 1.7 41.0 

19.8 28.2 59.4 11.7 44.9 14.2 29.7 68.7 
Two hundred 0j = 0.25 
0.0 0.0 0.4 0.0 0.1 0.0 0.4 29.9 

31.1 56.1 126.0 24.2 155.0 43.8 146.1 173.0 

All 9i = 0.25 

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 

54.4 112.4 341.1 50.5 390.0 153.7 400.0 400.0 

Common correlation: p = 0.5 
All 9i = 

5.5 0.1 5.5 5.5 5.5 2.2 5.5 51.4 

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 

One hundred t = 0.25 
0.4 0.5 4.4 0.5 8.0 0.7 4.2 50.5 

27.2 29.8 48.2 20.0 37.7 17.9 34.0 86.3 

Two hundred 0i = 0.25 
0.4 0.5 5.1 0.3 7.8 1.1 5.0 50.2 

47.2 57.1 99.3 42.4 106.3 51.1 92.7 183.7 

All 6i = 0.25 

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 

93.3 114.0 237.7 93.0 314.9 169.8 282.7 395.5 

Common correlation: p — 0.8 
All 0i = 

5.3 0.1 5.2 5.3 5.3 0.7 5.3 51.3 

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 

One hundred 0i = 0.25 

4.1 0.5 4.5 4.9 6.2 0.7 4.5 50.8 
44.3 31.6 57.7 39.0 47.8 19.2 49.2 95.0 

Two hundred 0j = 0.25 

4.2 0.1 5.4 5.4 6.6 1.2 5.4 50.5 

82.5 59.3 116.3 80.3 115.4 52.4 112.6 192.9 

All 0i = 0.25 

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 

172.8 117.0 255.1 174.4 301.9 149.5 275.3 392.8 
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• The exception is EBo.i, which can be quite liberal, in particular when 
s = 50 and all null hypotheses are true. As acknowledged to us by the 
authors of [33] , this method is not consistent when all null hypotheses are 
true and they advocate its use only in settings when false null hypotheses 
can be anticipated. We provide a brief explanation in Appendix A. 

• The relative power of the conservative methods A>gH and LRo.i compared 
to the procedures based on the bootstrap A;- Aug and &;-Boot decreases as 
the common correlation p increases. 

• Depending on context, fe-Boot can detect many more false alternatives 
compared to 1-Boot. The same is not true for A;- Aug, since, by design, 
it detects at most k — 1 more false hypotheses compared to 1-Boot. So 
especially when s is large, this approach appears suboptimal. Even the 
conservative A>gH method can be more powerful than the augmentation 
method for large s. 

• The comparison is similar for the various FDP procedures. Of all the 
procedures that provide satisfactory finite sample control, Booto.i is the 
most powerful one. Augo.i becomes uncompetitive when s is large and 
can even be outperformed by the conservative LRo.i method. Note that 
EBo.i is often more powerful than Booto.i, but given that its overall finite 
sample control is not satisfactory, one should be cautious in using this 
method. 

• The power advantage of A>Boot and Booto.i over 1-Boot diminishes as 
the common correlation p increases. (As a result, the same is true for the 
power advantages of A>Boot over A:- Aug and of Booto.i over Augo.i, resp.) 
This is not surprising. Take the extreme case of p = 1 in our simulation 
set-up where all nonzero means are equal. In this case 1-Boot rejects 
either no false hypotheses or all false hypotheses. On the other hand, k- 
Boot rejects either at most k — 1 false hypotheses (when the k — 1 most 
significant hypotheses are rejected regardless) or also all false hypotheses. 
(Note that the 1-max of the "alternative" test statistics will be equal 
to the fe-max of the "alternative" test statistics and analogously for the 
"null" test statistics.) This implies a minimal power gain of /c-Boot over 
1-Boot compared to the case of p = where the additional number of 
rejected false hypotheses can far exceed k — 1. 

The procedure controlling the median FDP (last column) is always the 
most powerful one. However, it should be understood that it is philosoph- 
ically different from the other FDP controlling procedures. If P{FDP > 
0.1} < 0.05 is achieved, then, in a given application, one can be 95% confi- 
dent that the realized FDP is at most 0.1. On the other hand, if P{FDP > 
0.1} < 0.5 is achieved (i.e., control of the median FDP), then, in a given 
application, one can only be 50% confident that the realized FDP is at most 
0.1. So, loosely speaking, there is a good chance that the realized FDP ends 



26 



J. P. ROMANO AND M. WOLF 



up greater than 0.1, and perhaps by quite a bit. Romano and Wolf [27] ex- 
amine this issue in more detail by looking at the sampling distribution of 
the FDP in various scenarios when the median FDP is controlled; see their 
Figure 1. Depending on the underlying dependence structure, this sampling 
distribution can exhibit significant variation. As a result, the realized FDP 
may well be quite above 7 = 0.1. 

A similar problem arises in controlling the false discovery rate (FDR), as 
proposed by [1]. The FDR is the expected value of the FDP. Like the median 
FDP, it is also a measure of central tendency of the sampling distribution 
of the FDP. In a given application, the realized FDP can be quite far away 
from its expected value, the FDR, as made clear in [15] . 

Finally, some further simulations comparing the augmentation procedures 
of [34] and the procedures of [17] can be found in [8]. 

7. Concluding remarks. We have shown how computationally feasible 
step-down methods can be constructed to control generalized error rates 
in multiple testing. On the one hand, we have considered the fc-FWER, 
which is defined as the probability of making k or more false rejections. 
This concept would be appropriate when a given number of false rejections 
can be tolerated. On the other hand, we have also considered the FDP, 
which is the ratio of false rejections out of the total number of rejections 
(and defined to be zero when there are no rejections). This concept would be 
appropriate when a certain proportion of false rejections can be tolerated. 
Some simulations have shown that these less strict methods can reject many 
more false hypotheses compared to the traditional FWER control, especially 
when the number of hypotheses under test is large. 

Our step-down methods (asymptotically) account for the dependence 
structure across test statistics. As a result, they are more powerful than 
the generalized Holm step-down methods of [14] and [17], which are based 
on individual p-values and designed to handle a "worst case" dependence 
structure. An alternative approach that also accounts for the dependence 
structure across test statistics is the augmentation approach of [34]. How- 
ever, simulations show their methods are noticeably less powerful, especially 
when the number of hypotheses under test is large. The empirical Bayes 
method of [33] can sometimes be more powerful than our bootstrap ap- 
proach for FDP control. However, it also can be quite liberal and it does not 
offer asymptotic control of the FDP when all null hypotheses are true. Over- 
all, our methods for control of the /c-FWER and FDP appear competitive 
with or outperform currently available methods. 

APPENDIX A: PROOFS 

Proof of Theorem 2.1. Assume \I(P)\ > k, or there is nothing to 
prove. Consider the event that at least k true null hypotheses are rejected. 
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Let j be the (random) smallest index j in the algorithm where this occurs, so 
that k- max(T n) j : i G I(P)) > <i njJ 4,(l — a,k). By definition of j (now fixed), 
I(P) C A~- U Iq, where Jo is some set of indices satisfying Iq C R-- and |/o| = 
k — 1. Let L be any set of indices of false null hypotheses (not necessarily 
uniquely defined) which satisfy A~- U Iq = I(P) U L. Since <i n> A,(l — a,fc) 
is defined by taking the maximum over sets / of c n ./^(l — a,k) with K = 
Aj U / as / varies over indices satisfying / C R and 1 1\ = k — 1, it follows 

that d n> Aj (1 — a,/c) > c n /(p) U ^(l — a, A;). By the monotonicity assumption, 
Cn,/(P)uL(l ~~ a,k) > c nj /(p)(l — a,k). To summarize, the event that at least 
k true null hypotheses are rejected implies that 

k- max(T n) j : i G I(P)) > c n>I ^(l - a,k) 

and so (i) follows. Part (ii) follows immediately from (i). □ 

Lemma A.l. Let k < s. (i) The A; -max function is continuous; that is, if 
Vn = (y n ,i, ■ ■ -,yn,s) e K s and y n ^y£ then, as n -> oo, A>max(y nj i, . . . , 
y n ,s) ^ k-ma^(y 1 , . . . ,y s ) . 

(ii) IfY n G W andY n Y, then k-max(Y n; i, . . . ,Y ntS ) /s-max(Yi, . . . ,Y S ). 

(iii) Furthermore, if each Y{ in (ii) has a continuous marginal distribu- 
tion, then the distribution of /c-max(Yi, . . . , Y s ) is continuous. 

Proof. Part (i) is trivial, and the continuous mapping theorem then 
implies (ii). To prove (iii), P{k- max(Yi, . . . , Y s ) = x] < J2t=i P{Yi = x}. □ 

Proof of Theorem 3.1. To prove (i), by Corollary 3.1 it is sufficient 
to show that 

(28) limsupP{A;-max(T n .j : i G I(P)) > b n I ^{l — a, k, Q n )} < a. 

n 

Since 0j(P) < for i G I(P), it follows that 

A;-max(T nj j : i G I(P)) = k-m.ax(r n 6 n< i : i G I(P)) 

< k-max(T n [9 n>i - 9 l {P)]:ie I(P)). 
Therefore, the left-hand side of (28) is bounded above by 

(29) limP{A;-max(T n [0 n)i - 0*(P)] :i G /(P)) > k AP) {l - a,k,Q n )}. 

Assumptions Bl and B2 together with the continuous mapping theorem 
imply that 

p 

P{L n ,i(P)(k, P),L n>I ( P ) (k, Q n )) 0, 
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for any metric p metrizing weak convergence on R. Hence, it follows that 

(29) is equal to a, by an argument very similar to the proof of Theorem 1 
of [3]. 

To prove (ii), assume 9i(P) > 0. Assumptions Bl and B2 together im- 
ply that & n ,Ai (1 — a, k, Q n ) is stochastically bounded, where A\ = {1, . . . , s}. 

Furthermore, by the continuous mapping theorem, r n [#j n — 0i(P)] has a lim- 

p 

iting distribution, so T n/ j = T n 6i >n — > oo. Therefore, with probability tending 
to one, T n i > & n ,^li(l ~~ o:,k,Q n ), resulting in the rejection of Hi in the first 
step of Algorithm 2.1. □ 

Proof of Theorem 3.2. The proof is completely analogous to the 
proof of Theorem 3.1. The only additional fact needed to prove (hi) is that, 
when 6i(P) > 0, T n Tl) i > with probability tending to one, and similarly for 
Oi(P) < 0. Indeed, Assumption Bl(i) implies r n [6 n ^ — 0i(P)] has a limiting 

distribution, which implies T n 6 n ^ — > oo when 6i{P) > 0, and r n Q n ^ — > — oo 
when 9i(P) < 0. □ 

Proof of Theorem 3.3. To prove (i), note that by reasoning similar 

P 

to before, min(T ni j :i f. I(P)) — > oo. On the other hand, max(T ni j :i £ I{P)j 
is either bounded in probability, in case Oi{P) = for at least one i G I(P), 

or max(T nj j :i G I(P)) — > — oo, in case 6i(P) < for all i G I{P)- Therefore, 
the event 

(30) min(T nji : i £ I(P)) > max(T n4 : i G J(P)) 

has probability tending to 1. But if the event (30) happens, then the re- 
jected true hypotheses (if such exist) will always be the least significant 
hypotheses among the rejected hypotheses at any stage. This together with 
the monotonicity of the critical values & re ,l<r(l — ot,k,Q n ) allows us to follow 
asymptotic control of the A:-FWER from (28) even when Algorithm 2.2 is 
used. But (28) was already established in the proof of Theorem 3.1. 
The proof of (ii) is identical to the proof of (ii) of Theorem 3.1. □ 

Proof of Theorem 3.4. The proof of (i) is the essential subsampling 
argument, which derives from (24) being a [/-statistic; see Theorem 2.6.1 
of [21] where one statistic is treated, but the argument is extendable to the 
simultaneous estimation of the joint distribution. The result (ii) follows as 
well. To prove (iii), note that by Corollary 3.2 it is sufficient to show that 

(31) limsupP{/c- max(T ni j : i G I(P)) > c n jip\(l — a, k)} < a. 

n 

But part (ii) of the theorem implies, for any e > 0, 

c n j(p\(l — a, k) > cjrp\(l — a, k) — e with probability — > 1. 
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Therefore, using Assumption S, the limit superior of the probability of vio- 
lation of the fc-FWER criterion is bounded above, for any e > 0, by 

limsup/c-FWERp < P{k-m&x(Ti,i G I(P)) > c J(P) (l - a) - e}, 

n 

where (Ti,i S I(P)) denote variables whose joint distribution is Gj^(P). 
But letting e — > 0, the right-hand side of the last expression becomes 

1 — Hjrp\(cjf P \(l — a),P) = 1 — (1 — a) = a. □ 

Proof of Theorem 4.1. To prove (i), note that by reasoning similar 
to the proof of part (i) of Theorem 3.3, with probability tending to one, 
all false hypotheses are rejected before any true hypothesis comes under 
scrutiny. Therefore, with probability tending to 1, a violation of the FDP 
criterion occurs if and only if the event 

(32) F> ^_( S _| / (P)|) 

1-7 

occurs, where F is the number of true hypotheses rejected by Algorithm 4.1. 
Let F(k) denote the number of true hypotheses rejected by the bootstrap 
/c-FWER procedure. Furthermore, let k* denote the smallest integer greater 
than (7/(1 — 7)) (s — Assume |/(-P)| > k* or there is nothing to prove. 

By the above argument, we therefore have 

limsupP{FDP > 7} = limsupP{F > k*} 

n n 

(33) < limsupP{F(A;*) > k*} 

n 

< a [by part (ii) of Theorem 3.1]. 

To see that (33) holds true, note the following two facts. First, the boot- 
strap /c-FWER procedure is monotone in k: any hypothesis rejected by the 
/ci-FWER procedure will also be rejected by the /C2-FWER procedure as 
long as k\ <k2- Second, according to step 3(a) of Algorithm 4.1, the algo- 
rithm terminates with the application of the fc*-FWER procedure, or even 
before then, if 

(34) JV fc . < — - 1. 

7 

In case all false hypotheses are rejected first, the event (34) happens if and 
only if 

(35) k* > 7—^— (« - I(P) ~ [F(k*) - (k* - 1)]). 

1-7 

By the definition of k* , the inequality (35) will hold as long as F(k*) < k* — 1. 
Therefore, the event F(k*) < k* — 1 implies that (1) F(k) < k* — 1 for any 
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k < k*; and that (2) Algorithm 4.1 terminates with the application of the 
fc*-FWER procedure, or even before then, if all false hypotheses are rejected 
first (which happens with probability tending to 1). These two facts together 
demonstrate the validity of (33). 

The proof of (i) follows immediately from part (ii) of Theorem 3.1. □ 

APPENDIX B 

We briefly argue why the method in [33] does not provide even asymptotic 
control of the FDP when all null hypotheses are true. For this, assume 
there is one null hypothesis, so s = 1 (or m = 1 in the notation of [33]); the 
argument generalizes to arbitrary s. Control of the FDP when s = 1 reduces 
to control of the F WER, so the probability of rejecting a true null hypothesis 
must be bounded above by a. 

Suppose Xi, . . . , X n are i.i.d. N(9, 1). Consider testing the null hypothesis 
H: 9 = against 9 > 0. Let T n = T n>1 = n" 1 / 2 J27=i Xi. Let $(•) denote the 
c.d.f. of the standard normal distribution and </>(•) its density. 

Under H, T n ~ iV(0, 1) and so /o (in the notation of [33]) is (p. The algo- 
rithm of [33] simplifies to the following: 

1- If T n < 0, let 7T = 1; otherwise, let ir = 0(T n )/</>(O). 

2. Determine c as the solution to ir(l — ^(c)) = a. That is, c = <£ -1 (l — a/ir), 
with $ _1 (A) defined as — oo for A < 0. 

3. Reject H if T n > c. 

Now assume 9 = 0. Since <£ -1 (l — a/ir) < < I >_1 (1 — a) with positive prob- 
ability, H is rejected with probability greater than a. Moreover, 7r does 
not even converge to 1 in probability [since T n has a nondegenerate iV(0, 1) 
distribution for every n, and asymptotically in typical nonparametric prob- 
lems]. As an example, for a = 0.05, a numerical simulation based on 100,000 
repetitions results in a rejection probability of 0.107. 
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